# Run before lecture to load datasets and do simple prep
library(tidyverse) #all our data wrangling/plotting
options(repr.matrix.max.rows = 6)
# Making scatter points a bit bigger so that students can see them
update_geom_defaults("point", list(size = 3))
#Mauna Loa
co2_df <- tibble(
concentration = as.vector(co2),
date = lubridate::date_decimal(as.numeric(time(co2)))
)
#Top 12 Island landmasses
islands_df <- enframe(islands)
colnames(islands_df) <- c('landmass', 'size')
islands_df = top_n(islands_df, 12, size)
continents <- c('Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America')
islands_df <- mutate(islands_df, is_continent = ifelse(landmass %in% continents, 'Continent', 'Other'))
gapminder <- read_csv("data/gapminder.csv")
gapminder_2016 <- gapminder |>
select(country, year, continent, life_expectancy) |>
filter(year == 2016)
#old faithful, mtcars -- nothing to do
Rows: 10545 Columns: 9 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (3): country, continent, region dbl (6): year, infant_mortality, life_expectancy, fertility, population, gdp ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DSCI 100 - Introduction to Data Science¶
Lecture 4 - Data visualization in R¶
Attribution: images in these slides that are not accompanied by code mostly come from
The Fundamentals of Data Visualization by Claus O. Wilke

Artwork by @allison_horst
Designing a visualization: ask a question, then answer it¶
The purpose of a visualization is to answer a question about a dataset of interest.
A good visualization answers the question clearly. A great visualization also hints at the question itself.
Visualizations alone help us answer two types of questions:
- descriptive: What are the largest 7 landmasses on Earth?
- exploratory: Is there a relationship between penguin body mass and bill length?
inferentialpredictivecausalmechanistic
(we need more tools + visualizations to answer the others)
Creating visualizations in R¶
It's an iterative procedure. Try things, make mistakes, and refine!
We will use
ggplot2. There are three key aspects of plots inggplot2:- aesthetic mappings: map dataframe columns to visual properties
- geometric objects: encode how to display those visual properties
- scales: transform variables, set limits
Add these one by one using
+
Types of variables¶
A variable refers to a characteristic of interest and can be:
- categorical: can be divided into groups (categories) e.g. marital status
- quantitative: measured on a numeric scale (usually units are attached) e.g. height
Scatter Plots¶
To visualize the relationship between two quantitative variables
e.g. Is there a relationship between horsepower and fuel economy of an engine? Does the number of cylinders affect that relationship?
# Load libraries for wrangling and plotting
library(tidyverse)
# Inspect the data
mtcars
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| Ferrari Dino | 19.7 | 6 | 145 | 175 | 3.62 | 2.77 | 15.5 | 0 | 1 | 5 | 6 |
| Maserati Bora | 15.0 | 8 | 301 | 335 | 3.54 | 3.57 | 14.6 | 0 | 1 | 5 | 8 |
| Volvo 142E | 21.4 | 4 | 121 | 109 | 4.11 | 2.78 | 18.6 | 1 | 1 | 4 | 2 |
# Set the default size for all plots
options(repr.plot.width = 10, repr.plot.height = 8)
# Is there a relationship between fuel economy and horsepower?
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
theme(text = element_text(size = 30)) +
labs(x = "Horsepower", y = "Miles per gallon", color = "Num of Cylinders")
Line Plots¶
To visualize trends with respect to an independent quantity
e.g. How has atmospheric carbon dioxide changed over the last 40 years?
Mauna Loa Research Station
# Inspect the data
co2_df
| concentration | date |
|---|---|
| <dbl> | <dttm> |
| 315.42 | 1959-01-01 00:00:00 |
| 316.31 | 1959-01-31 10:00:00 |
| 316.50 | 1959-03-02 20:00:00 |
| ⋮ | ⋮ |
| 360.83 | 1997-10-01 18:00:00 |
| 362.49 | 1997-11-01 04:00:00 |
| 364.34 | 1997-12-01 14:00:00 |
# Change the default text size for all plots
theme_set(theme_gray(base_size = 26))
# How does atmospheric CO2 concentration change over time?
ggplot(co2_df, aes(x = date, y = concentration)) +
geom_line() +
labs(x = "Date", y = "CO2 Concentration (ppm)")
Bar Plots¶
To visualize the comparison of amounts
e.g. Which are the largest 12 island landmasses on Earth? Are they all continents or are there some other islands with large landmasses as well?
# Inspect the data
print(islands_df, n = 12)
# A tibble: 12 × 3 landmass size is_continent <chr> <dbl> <chr> 1 Africa 11506 Continent 2 Antarctica 5500 Continent 3 Asia 16988 Continent 4 Australia 2968 Continent 5 Baffin 184 Other 6 Borneo 280 Other 7 Europe 3745 Continent 8 Greenland 840 Other 9 Madagascar 227 Other 10 New Guinea 306 Other 11 North America 9390 Continent 12 South America 6795 Continent
# What are the largest 12 island landmasses on Earth?
ggplot(islands_df, aes(x = size, y = reorder(landmass, size), fill = is_continent )) +
geom_bar(stat = "identity") +
labs(y = "Landmass", x = "Size (1000 sq miles)", fill = "Is continent?")
Histograms¶
To visualize the distribution of a single quantitative variable
e.g. Was there a difference in life expectancy across different continents in 2016?
# Inspect the data
gapminder_2016
| country | year | continent | life_expectancy |
|---|---|---|---|
| <chr> | <dbl> | <chr> | <dbl> |
| Albania | 2016 | Europe | 78.1 |
| Algeria | 2016 | Africa | 76.5 |
| Angola | 2016 | Africa | 60.0 |
| ⋮ | ⋮ | ⋮ | ⋮ |
| Yemen | 2016 | Asia | 64.92 |
| Zambia | 2016 | Africa | 57.10 |
| Zimbabwe | 2016 | Africa | 61.69 |
# Was there a difference in life expectancy across different continents in 2016?
ggplot(gapminder_2016, aes(x = life_expectancy, fill = continent)) +
geom_histogram() +
facet_grid(continent ~ .) +
labs(x = "Life Expectancy (years)", y = "Count")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A few rules of thumb for creating effective visualizations¶
Rule of Thumb: No tables / pie charts / 3D¶

Which one is easier to interpret? Pie graph - colours don't mean anything (unneccessary)
- hard to see size of slices relative to the other slices
Rule of Thumb: No tables / pie charts / 3D¶

- the third dimension does not improve the reading of the data
- these plots are difficult to interpret because of the distorted effect of perspective associated with the third dimension.
- 3D is discouraged for charts in general, and should only be used for very specific applications
- the bars or slices in a pie graph that are closer to the reader appear to be larger than those in the back due to the angle at which they're presented
Rule of Thumb: Use simple, colourblind-friendly colour palettes¶

Rule of Thumb: Include labels and legends, make them legible¶
Remember: a great visualization tells its own story without needing you to be there explaining things


options(repr.plot.width = 4, repr.plot.height = 4)
diamond_plot <- ggplot(diamonds, aes(x = carat, y = price)) +
geom_point() +
xlab("Size (carat)") +
ylab("Price (US dollars)")
diamond_plot